03/09/2019

Not every plot is a great plot…

Why do we visualize data?

Three key reasons for visualizing data

  • Exploratory data analysis
  • Checking assumptions
  • Communicating results

Exploratory data analysis (EDA)

  • It’s useful to know how our data looks like before we analyze it
  • Can help us avoid potential pitfalls and generate new hypotheses (for future experiments)
  • “There is NO question of teaching exploratory OR confirmatory [analysis] – we need to teach both!” (Tukey, 1980)

Checking assumptions

  • After we fit a statistical model, it’s always a good idea to visualize the residuals to check whether our assumptions have been met
  • Normality: are our residuals normally distributed?
  • Homoscedasticity: is the variance constant across the range of predicted values?

Communicating results

  • If we get exciting results, we want to effectively communicate them with other people (and impress journal editors)!

Plot trivia

The plots were all displaying the same data!

All the plots on top of each other

The data used to generate the plots

ID Age Group Treatment Outcome
47 Younger A Yes 18.06815
187 Younger B No 17.18054
191 Younger B No 16.74415
41 Younger A Yes 17.85739
153 Younger B No 19.01744
130 Older B Yes 20.68289

Same data, different plots

  • We can display the same data in different ways
  • Different looking plots can have things in common
    • (x-axis, y-axis, colors to identify levels of a variable, data)

Ggplot2

Ggplot2 and the grammar of graphics

  • ggplot2 is an R package developed by Hadley Wickham
  • It provides a universal “grammar” for making plots
  • It is also built around the idea that different looking plots can have things in common

How ggplot2 works

  • Every plot in ggplot2 is composed of three basic components:
    • Data: the information we want to display in the graph
    • Aesthetics: the scales we want to map our data on to
    • Geoms: the geometric objects we want to represent our data with
  • Any plot can be made by specifying these components!

How to make a ggplot2 plot

ggplot(data = ... , aes(x = ... , y = ...)) +
  geom_ ...
  • Data: specify the data you want to plot
  • Aes(thetics): specify what variables you want to put on the x-axis, y-axis, colour, etc…
  • Geom(etric object): specify what geometric object(s) you want the data represented in
  • That’s it!

Example 1: Make a boxplot

  • We first specify the data, then we specify the x and y axis, and last we specify the geom
ggplot(data = my_data, aes(x = Group, y = Outcome)) +
  geom_boxplot()

Example 2: Make a violinplot

  • To make a violin plot, we can just swap the geom:
ggplot(data = my_data, aes(x = Group, y = Outcome)) +
  geom_violin(trim = FALSE)

It’s a little bit like Lego

Example 3: Make a scatterplot

  • …and we can do the same for a categorical scatterplot! (I used geom_jitter so that the points do not overlap)
ggplot(data = my_data, aes(x = Group, y = Outcome)) +
  geom_jitter(height = 0, width = 0.1)

Enough with boring plots - what if we want some color?

  • Colour and fill of geoms can be defined as aesthetics in the same way that x and y axis are
  • This is useful if we have a 3rd variable from our data that we want to show
  • Or we just want to make our plot prettier

Example 4: Redundant fill

  • To distinguish our geoms by fill, we simply need to assign a variable to fill aesthetic:
ggplot(data = my_data, aes(x = Group, y = Outcome, fill = Group)) +
  geom_boxplot()

Example 5: Redundant colour

  • We can do the same with colour (for 2D geoms like boxplot, colour describes the outline of the geom)
ggplot(data = my_data, aes(x = Group, y = Outcome, colour = Group)) +
  geom_boxplot()

Example 6: Redundant colour for 1D geoms

  • 1D geoms like points or lines do not have fill, only colour
ggplot(data = my_data, aes(x = Group, y = Outcome, colour = Group)) +
  geom_jitter(height = 0, width = 0.1)

Example 7: Extra colour variable

  • We can also make colour represent a meaningful third variable like so:
ggplot(data = my_data, aes(x = Group, y = Outcome, colour = Age)) +
  geom_jitter(height = 0, width = 0.1)

Example 8: Extra fill variable

  • We can use fill to add an extra information to our plot, just like colour!
ggplot(data = my_data, aes(x = Group, y = Outcome, fill = Age)) +
  geom_boxplot()

Beyond two dimensions…

  • We can use aesthetics such as colour, fill, linetype, size, etc… to represent additional variables in our 2D plots
  • We can think of it as adding extra dimensions to our plot!

Remember how I said ggplot2 is like lego?

Example 9: Beyond one geom

  • We can also stack multiple geoms on top of each other like lego bricks:
ggplot(data = my_data, aes(x = Group, y = Outcome, 
                           fill = Group, colour = Group)) +
  geom_jitter(height = 0, width = 0.1) +
  geom_boxplot(alpha = 0.5, colour = 'black')

Stacking geoms

  • Stacking geoms is a powerful way to display complex information!
  • For example, you can have one geom represent your actual observed data (e.g. points for data points) and another geom to represent summary statistics (e.g. boxplot displays median, interquartile range, and outliers)
  • If you fit a statistical model to your data (e.g. a linear regression), you can stack it on top of your data points too!

How to change the axis labels?

  • Ggplot2 automatically uses the names of the variables you specify in aesthetics as the axis labels
  • You can change these via the labs() argument:
ggplot(data = my_data, aes(x = Group, y = Outcome, 
                           fill = Group, colour = Group)) +
  geom_jitter(height = 0, width = 0.1) +
  geom_boxplot(alpha = 0.5, colour = 'black') +
  labs(title = 'This is objectively the best plot', 
       x = 'Some letters', y = 'How much you should love this plot')

How to change axis limits, breaks?

  • Ggplot2 automatically picks what it thinks are the most useful axis limits & breaks
  • You need can add a scale argument to change these:
ggplot(data = my_data, aes(x = Group, y = Outcome, 
                           fill = Group, colour = Group)) +
  geom_jitter(height = 0, width = 0.1) +
  geom_boxplot(alpha = 0.5, colour = 'black') +
  scale_y_continuous(limits = c(0, 50), breaks = seq(0, 50, by = 5))

Why hasn’t this guy received any love yet?

Bar plots are evil!

  • Many statisticians and data vis experts strongly advocate AGAINST the use of bar plots (aka “dynamite plots”)
  • ggplot2 does not have a default plot for making bar plots (although you can still make them, as you can see)

Why are bar/dynamite plots evil?

  • Dynamite plots look like they show more information than they really do:

  • The area of the bar graph doesn’t represent any meaningful information - it’s just wasting ink!

Bar plots also hide unequal distributions of data

So what are bar plots good for?

  • Bar plots were designed to show counts of discrete data
  • That’s it!
  • For everything else, use boxplots, scatterplots, kernel density estimates, etc…

So what should we do?

Principles of graphical excellence (Tufte, 2001)

  • Graphical excellence is the well-designed presentation of interesting data - a matter of substance, statistics, and design
  • Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency
  • Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least in on the smallest space
  • Graphical excellence is nearly always multivariate
  • And graphical excellence requires telling truth about the data

Principles for making good plots

  • Keep it simple, less is more
  • Show your data, not just summaries (whenever possible)
  • Show uncertainty, not just your model
  • Find the right geoms to represent your information (e.g. boxplots for grouped continuous data, line plots for time-series,…)

To summarize:

  • You can show the same data in different ways
  • Some ways of showing the data are better than others (boxplots > dynamite plots)
  • ggplot2 allows you to easily switch between different ways of plotting the same data and build your plots piece by piece like lego

Congratulations! You are now a ggplot2 expert!

Time to plot

  • You can now go and make some ggplot2 plots of your own!
  • There are multiple pre-loaded datasets that you can explore & visualize. The handout uses the diamonds dataset, other good options are mtcars, iris, & starwars datasets.
  • To see what variables are included in a dataset, use either the str() or glimpse() function.
  • For more info, see the handout. If you’ve brought your own data, yell at me and I’ll come help you load it into R.
  • Happy plotting!